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Abstract 

Background: The Portable Document Format (PDF) is the most commonly used file format for online scientific 
publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents 
a significant challenge for developers of biomedical text mining or biocuration informatics systems that use 
published literature as an information source. In this paper we introduce the 'Layout-Aware PDF Text Extraction' 
(LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining 
applications. 

Results: Our paper describes the construction and performance of an open source system that extracts text blocks 
from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize 
specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant 
as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, 
such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using 
spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical 
categories using a rule-based method and (3) Stitching classified text blocks together in the correct order 
resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text 
blocks and classify them into rhetorical categories with Precision 1 =0.96% Recall = 0.89% and F1 =0.91%. We also 
present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have 
compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed 
Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2 commonly used 
to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of 
improvement. 

Conclusions: LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The 
release of the system is available at http://code.google.eom/p/lapdftext/. 



Background and motivation 

The field of Biomedical Natural Language Processing 
(BioNLP) is maturing, with specific fields of software de- 
velopment in response to user requirements: e.g., links 
between databases and literature, better tool interactivity 
and integration and the development of high-quality NLP 
resources [1,2]. NLP techniques such as Named Entity 
Recognition [3] and Semantic Relation Extraction [4] have 
been shown to be very useful to biologists studying pro- 
tein-protein interactions [5] and Gene-Disease-Phenotype 
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relations [6]. Given the ubiquity of the 'Portable Docu- 
ment Format' (PDF) as a means of distributing scientific 
publications and since access to information in full-text 
documents is vital for developing effective text-mining 
applications [7], it is essential to the general BioNLP com- 
munity that developers of such applications can extract 
the textual content from PDF files accurately with open- 
source tools. Many past biomedical text mining studies 
have used either the abstracts of scientific papers [8-11] or 
relatively small collections of full-text articles sampled 
from the Open Access subset of PubMed Central [12]. It 
is likely that certain content of journals of interest in a 
particular task is not distributed as a part of the Open 
Access subset. 
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A long-standing promise of BioNLP has been to help ac- 
celerate the vital process of literature-based biocuration, 
where published information is carefully translated into the 
knowledge architecture of biomedical databases, using spe- 
cific BioNLP tools [1,8,13]. The identification of all papers 
relevant to the specific database being populated can be 
considered as a document classification problem [12]. Sub- 
sequent steps have been cast as Information Extraction 
(IE) problems that leverage context dependent features 
[14,15]. A key consideration is that the well-crafted manual 
workflows, developed by expert curators in biomedical 
databases, typically use rules based on context and rhet- 
orical structure-dependent clues found only in the full-text 
of an article. Thus, it is important for the developers of 
BioNLP applications to have access to an accurate repre- 
sentation of the full-text of papers derived from PDF files, 
see [16]. 

Our goal is to provide an open-source software mechan- 
ism for automated decomposition and conversion of PDF 
files of research articles into a simple text format that other 
NLP groups can easily incorporate into their toolsets. In 
the most widely used text extraction programs (e.g., Adobe 
Acrobat, Grahl PDF Annotator, IntraPDF, PDFTron and 
PDF2Text), the flow of the main narrative from a file may 
be broken in mid sentence by errors derived from the read- 
ing order of individual text blocks and interruptions such 
as the inclusion of figure captions, footnotes and headers. 

The variation in styles and formats of research articles 
(even within a single journal) can cause errors in terms of 
the ordering and splicing of text between pages and 
blocks. Any software that performs such decomposition 
and extraction should be adaptable with minimal human 
effort to new styles and formats. Driven by these needs, 
our system focuses on providing an open source PDF-to- 
text conversion capability meeting the following require- 
ments: (1) the extraction mechanism should be able to 
adapt to single-column, two-column or mixed single and 
double column layouts, (2) extracted text should be error- 
free and grouped according to specific section headings 
used in the paper and (3) formatting artifacts such as, 
headers, footers, figures, tables and floating boxes (used in 
author summaries) should not interrupt the narrative -flow 
within each section. Thus, we have developed a three-step 
approach for extracting text from PDF files. The first step 
is the identification of contiguous text blocks. The second 
step is the classification of these text blocks into rhetorical 
categories (such as Introduction, 'Results' and 'Discus- 
sion') using logical rules that are easy to generate as 'deci- 
sion tables' in a spreadsheet. The third step utilizes the 
classification results to stitch' appropriate text blocks to- 
gether for extracting the text, while ignoring blocks that 
contain formatting embellishments so as to minimize 
flow-disruption of the extracted text. Our system provides 
programmatic, open-source access to each one (or to all 



three) of these capabilities for individual files or large 
collections of files. 

Implementation 

Step 1 - Detecting contiguous text blocks 

The first step in LA-PDFText is to identify contiguous text 
blocks. In addition to the frequently-used two-column and 
single-column formats, journals also often use a mixed for- 
mat where the title, authors, affiliation and abstract span 
the entire page width (single-column format) while all 
other sections of the article use a two-column format. We 
have observed these changes in format by manually 
inspecting papers from all available issues of the journal 
Brain Research. We denote these periodic changes in for- 
matting over the lifetime of a given journal as epochs'. 

Our approach to detecting contiguous text blocks starts 
with detecting word-blocks' (bounding boxes of words). 
We use the GPL version of JPedal, an open-source Java 
PDF library to obtain the bounding boxes of each word in 
the PDF article (http://www.jpedal.org/). Using this as a 
starting point, LA-PDFText aggregates word-blocks sys- 
tematically to build chunk-blocks' of text while respecting 
formatting constraints such as two-column vs. one-column 
formatting. As shown in Figure 1, the algorithm for identi- 
fying text blocks, functions by coalescing word blocks to- 
gether that are close enough (based on the spatial statistics 
of the words' layout on the page) and share font character- 
istics. The algorithm computes proximity automatically on 
a per-page basis giving it flexibility in dealing with varying 
formats both within a single page and across pages. 

Figure 1 is an example of how the block detection algo- 
rithm decides which word blocks to coalesce. Examples of 
the parameters 5w horizontal , 6w vertica i and w hei ght are shown 
in Figure 1. The distributions of these parameter values are 
calculated for each page and the most popular values for 
these parameters are chosen from these distributions to 
calculate (|) EW and (|) N s- We intentionally do not use most 
popular word width since biomedical text uses many long 
words and the most popular word width will make ()) EW 
too large thereby making the block subsumption algorithm 
too greedy. Consider the words Introduction' in the sec- 
tion heading, the word antimicrobial' in the first line of the 
first column and the word 'the' in the third line of the sec- 
ond column in Figure 1. Each word-block (shown in red) is 
surrounded by an expanded bounding-box (shown using a 
blue dotted line). All word-blocks (shown in red) that 
intersect with this expanded bounding-box are treated as 
words blocks to be merged. The block merge procedure is 
a greedy algorithm and will combine a section heading, 
subheading and the sections content into a single block 
based on the 4> EW and (|) NS parameters. 

To examine the flexibility of our block detection algo- 
rithm, we use a PDF file of the Nature editorial in Vol- 
ume 466 Issue no. 7303 (Figure 2). This issue contains 3 
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broad protection against pathogens to all multic ellular organisms. 
DrosopMla has become a model for studying the role of hematopoietic 
(blood) cells and the evolution of cellular immunity (reviews by 
[1 .2]). Similar to vertebrates, Drosophila hematopoiesis occurs in two 
wax es during development [3J. A first population of hemocytes is 
specified in the embryo and gives rise to plasmatocytes involved in 
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Figure 1 Block detection per-page parameter computation algorithm. The image shows the process by which the north-south and east-west 
parameters for neighboring block subsumption are computed. For an explanation of the symbols shown in the figure please see Table 1. 



editorials, and the second page contains part of the sec- 
ond editorial along with the third, separated by a hori- 
zontal line. LA-PDFText was able to accurately identify, 
classify and extract text from the editorial PDF file. 

Step 2 - Classifying text blocks into rhetorical categories 

The next phase of LA-PDFText is based on 'DROOLS,' a 
business rule management system and an enhanced Rules 
Engine implementation, ReteOO, based on the Rete algo- 
rithm [17] tailored for the Java language distributed as part 
of the open-source JBoss Enterprise Platform (http://labs. 
jboss.com/portal/jbossrules/). DROOLS provides a way for 
the LA-PDFText user to declaratively specify characteristics 
of a text block that make it a part of a particular section in 
the paper. We include the rule files for two epochs within 
the PLoS Biology dataset in both the DROOLS format as 
well as Microsoft Excel (Additional files 1, 2 and 3). 

Step 3 - Stitching classified text blocks together in the 
correct order 

The final goal of LA-PDFText is to accurately extract 
the text of any given section(s) in the correct sequence. 
As an implementation of this capability the last com- 
ponent of the LA-PDFText iterates over the classified 
blocks and stitches the classified blocks together to 
produce contiguous sections along with section and 
sub-section headings appropriately demarcated. LA- 
PDFText provides mechanisms to output the text of 



these PDF as XML formatted using PubMed Centrals 
OpenAccess DTD. 

Results 

We have evaluated the three steps of our system inde- 
pendently of each other. In the following sections we 
will present our evaluation methods for each of the three 
steps of LA-PDFText and their results. 

Step 1 - Detecting contiguous text blocks - evaluation 

In order to evaluate the effectiveness of spatial segmenta- 
tion of each PDF page into text blocks, we manually seg- 
ment each page in our experimental dataset to produce the 
ideal segmentation of each paper. We then count the num- 
ber of edit operations (deleting and adding blocks) required 
to transform the manually segmented papers into to the 
segmentation predicted by our software. The ideal segmen- 
tation of a paper is one that does not require any deletion, 
addition or splitting of segments in order to retrieve the 
text from the segments in the correct order. We use the 
following guidelines in the manual segmentation process: 
(1) segments should be created in such a way as to facilitate 
sequence-preserving text extraction, (2) segments should 
be rectangular and (3) section headings and sub-headings 
should be marked as distinct segments from the body of 
their corresponding sections. Our algorithm creates images 
of each page of the input PDF showing the word block 
boundaries and the segment boundaries (Figure 2). To 



Table 1 Per page word block parameters symbols and their definitions 



Parameter Symbols 


Definitions 


Wheight 


Word block height 


6w horizonta | 


Horizontal space between words 


6w vertica | 


Vertical space between words 


maxk6w hei ght) 


Most Popular Word block height in a page / 


maXf(6w horizonta |) 


Most Popular Horizontal space between word blocks in a page /' 


maxi(6w vert icai) 


Most Popular Vertical space between word blocks in a page /' 


cp EW = maxj(6w height ) + maxj(6w horizonta ,) 


east-west word block expansion parameters in page 


cp NS = maxj(6w height ) + maxj(6w vertica |) 


north-south word block expansion parameters in page 
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EDITORIALS NATUREIV0U66I8 July 2010 



the literature — while also acting as a powerful deterrent to would 
be plagiarists. 

In the process, editors and publishers must remember that 
plagiarism comes in many varieties and degrees of severity, and 
that responses should be proportionate. : or example, past stud ies 
se.r'.i'.est thai sell p'.agiaiisr.i, in whii h a researcher copies his or her 
own woidsfioui a published papei, is far more common than plagia- 
rism of the work of others. Aru.uahly, self p.ai-.iar ism can sometimes 
u jnstilied, as when a rcscan liei is bringing similar ideas before 
readers of journals in a different field. All plagiarism ear. also involve 
honest errors 01 mitigating circumstances, such as 3 scientist with 3 
poor command of h'ng.'.isii paraphrasing some sentences of the intro- 
duction from similar work. 

Such examples underscore that plagiarism-detection software 
is an aid to, not a substitute tor, human judgement. One rule of 
thumb used by Nature journals and others in considering an arti- 
cle's degree of similarity to past articles - in particular, for small 
amounts of self- plagiarism in review articles — is whether the 



paper is otherwise of suffic ier.t or igir.ali-.y and interest. 

Nature Publishing Group is a member of Crosscheck and 
has been testing the service on submissions to its own jour nals. 
It has noted only t race levels of plagiarism in research, a r t ib.es. wh.i . h 
are spot-checked, and often in only the supplementary methods, 
b.agiarism has beer, mure common in. submitted reviews, .a ofw'.in 1 
are tested. This is particularly true in clinic,., reviews, although the 
rates are still far below the 1% mark, and in most instances concerned 
some level of self-plagiarism. 

Although the ability to deled plagiarism is a welcome advance, 
addressing Ir.e pro .lent at rls source remains the key issue. More 
and more learned societies, research institutions and journals have 
in recent years adopted compieb.er.sive etlti. 3, guidelines on plagi.t 
rism, many of which carefully distinguish between different levels of 
severity. It is crucial that research oit-ar.izatior.sir. all countries and 
particularly the mentoi s of young researchers, instil in their scientists 
the an epted i-.or 1:1s of the inter itatioital scientific community wher. 
it comesto plagi.tr ism .ntd publi, a: ion ethics. ■ 




The needs of the few 

Developing drugs for rare diseases is a challenge 
that requires new regulatory flexibility. 

On 29 June, Timothy Cote, head of the Office of Orphan Ind- 
ucts Development at the US Food and Drug Administra- 
tion (FDA), concisely summed up the agency's pol icies w ith, 
respect to the approve ofdi ,igs and otirei medic una! piodm tsfoi litre 
diseases: "No policy at alT 
The irony of this assessment isthat the United States h.as long hceit 

.: leader ir stlltr.l atlltg.t Tilnf'.wiri : "I : terapresl'or r. ledise. ses 
Cone.lcss passed thrOiphun Unlit Ai : ilt I'JS Milan attempt to ileal 
with the unique commercial and regulatory challenges posed by 
'orphan diseases, de fined as tliose that affect fewer than 200,000 Ameri- 
cans. For industry, there islittle appeal in pursuing a drug that will be 
required by only a small number of patients. For regulators accustomed 
to tlicilinii.il 11 iulstypii ally per for rued lor . um nion diseases, it t ar. he 
difficcr: 10 ascertain : le safety of a di .igthat,by re, essity, canbetested 
in only a tiny cohort of patients. 

The act aimed to incentivize orphan-drug development by reward 
iugdntg makers with a seven-year period of market exclusivity for 
such compounds. Ihe FDA also created the Office of Orphan Prod- 
ucts Development to shepherd companies through the approval pro- 
cess. Ten years later, lapan en.n ted similai legislation, and Europe 
followed suit in 2000. 

In many ways the act w as a success, in the decade before its 
passage, the FDA approved fewer than a dozen drugs for rare 
diseases, since then, the agency has approved 358. Nevertheless, 
the vast majority of the 7,000 known tare diseases remain without 
treatment. And, as Cote was explaining last week at the inaugu- 
ral meeting of the FDA's new expert panel on orphan diseases, 
the agency still has no policy guiding how it evaluates possible 
treatments for a rare disease. 



It is time for the FDA to develop one. The ranks of orphan diseases 
are gr owing better . ruder star. ding of common ailments — for exam- 
ple, through genome sequencing Is shattering old classification 
schemes, fragmenting many \ ommor.' diseases into smaller subtypes. 
The medical landscape will soon be crowded with orphans'. 

This means that the FDA will be seeing r.ioic applications bearing 
data from small clinical trials, thrusting regulators into the .tncom 
fortable position ot ascertaining safety and efficacy with less than 
optimal data ( lle.ssii al coal slamlard, placebo controlled slue. ies 
force researchers to divide their already tiny experimental cohort 
in half — one half that receives the experimental drug, the other a 
placebo. And because these diseases are often fatal (of those afflicted 
with one of the 350 most common rare diseases, 27% will not see their 
til st bii tiiday i, patients aie understandably loath to speitd m.u li : ime 
receiving a placebo. 

Asa result, the FDA will rteed to allow mote flexibility 11: . .mi, a. 
trial design. In some eases :. .is may mean a short placebo-controlled 
study that moves rapidly into an open-label trial, in which both 
researcher sand patients know what is lvir-.g administered In oilier 
1 ases it may mean ahar.dor.ir.g p.a. ebo controls altogether. Further- 
moie.post -marketing studies to monitor safety and efficacy of drugs 
after approval may have to be dor. ci-.iilr sm.clei sample sizes than are 
normally required. The FDA could also learn front Europe, wh id- 
has carved out an 'exceptional circumstances' pathway to approval 
for therapies tor w hich lull, gold standard clinii al In... dale, are 
not available. 

All of these issues will be under coi-.srderatiortastl'.eae.ai:. 1 s re-.-, 
ex-pert panel prepares art a.dvisoi y repoi t. due to be released ilt Sep- 
tember '.here are signs th.a: it will fall on receptive cars ir. remarks 
made before the Senate in March. FDA commissioner Margaret 
1 lambing expressed a commitment to finding new solutions to the 
problem of rare diseases -\ud tw olai ge p liar 111.11 eutical companies, 
GlaxoSmithKline and Pfizer, have recently announced new research 
divisions dedicated to orphan diseases. The present momentum 
should not be allowed to fail. ■ 
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Figure 2 Flexibility of the block identification algorithm. The image shown on left of the figure is taken from page 2, with two distinct 
articles, of the Nature editorial Volume 466 Issue no. 7303. The image on the right is an example of the debug output generated by LA-PDFText. 
Our block detection algorithm identifies the text blocks in the right column of the article page as distinct blocks allowing the subsequent block 
classification step of the system to apply rules that treat these blocks as parts of different articles. 



explain the evaluation process further, we present the fol- 
lowing sample situations (Table 2) that describe block con- 
figurations produced by LA-PDFText. In each case, we 
describe edit operations applied to the manually segmented 
page and their corresponding cost. The results of this 
evaluation are presented in Additional file 4: Tables S4, S5, 
S6 and S7 under the column titled 'Spatial Segmentation 
Score'. In the ideal case a paper segmented into blocks by 
LA-PDFText should have a spatial segmentation score of 
zero, indicating that it is perfectly segmented with respect 
to the manual segmentation. 

Step 2 - Classifying text blocks into rhetorical 
categories - evaluation 

The rule based segment classifier component of our soft- 
ware is instrumented to produce color-coded segments 



depending upon the type of section to which each seg- 
ment belongs. This color-coding is used in the manual 
evaluation to count the number of segments of each sec- 
tion that were correctly classified (true positives; TP), 
those that were incorrectly classified (false positives; FP) 
and those that were missed by the rule engine (false 
negatives; FN). Thus, we can calculate the Precision (P), 
Recall (R) and Fl metrics to evaluate the classification 
accuracy using the following metrics: 



TP 



TP + FP 



R 



TP 



TP + FN 



Fl 



P + R 



The results of this manual evaluation are reported in 
Additional file 4: Tables S4, S5, S6 and S7 under the col- 
umn titled 'Block Classification Performance'. Classification 



Table 2 Example scenarios describing conversion operations and their corresponding costs 



System Output 


Operation in Gold Standard Representation 


Cost 


Block is split 


Split the gold standard block into the required number 


1 


Big block is subsuming 


Delete the involved blocks in the gold standard and add one big block 


n + 1 


n small blocks 






n block are intersecting 


Delete all the blocks in the gold standard, whose area is 


Number of blocks deleted from gold 




common with the intersecting blocks, in the system output 


standard +1 



Ramakrishnan et al. Source Code for Biology and Medicine 2012, 7:7 
http://www.scfbm.Org/content/7/1/7 



Page 5 of 10 



Precision, Recall and Fl scores are averaged across all 
volumes and presented in Table 3 in a per-section basis. 

Step 3 - Stitching classified text blocks together in the 
correct order - Evaluation 

PDF2Text is a widely used approach to extract text from 
PDF files. However, it is unable to distinguish between 
formatting embellishments and the main narrative of a 
scientific article. PDF2Text treats the entire document 
as one string, introducing errors within individual sen- 
tences, at column breaks and page breaks. LA-PDFText 
classifies each text block and (provided the classification 
is accurate) stitches text blocks belonging to the same 
section together, in order to extract contiguous rhet- 
orical sections of the input articles. We have compared 
the text extraction capabilities of both systems to evalu- 
ation step 3 of LA-PDFText. Although PDF2Text is a 
simpler tool to use, we evaluate LA-PDFTexts text ex- 
traction capability against that of PDF2Text to show the 
benefit of our three-stage approach to text extraction. 

Figure 3 shows an example of the text extraction pro- 
duced by PDF2Text where the string "PLoS Biology | www. 
plosbiology.org 1" interrupts the preceding sentence. The 
interruption is precisely the sort of error that is unaccept- 
able in many applications of BioNLP, especially those 



Table 3 Per-section Precision (P), Recall(R), and Fl scores 
for section classification 



N 


Section Parts 


P 


R 


Fl 


Paper Title 




1.000 


0.966 


0.983 


Authors 




0.987 


0.906 


0.945 


Abstract 


Heading 


1.000 


1.000 


1.000 




Body 


0.988 


0.883 


0.933 


Introduction 


Heading 


1.000 


0.988 


0.994 




Body 


0.876 


0.915 


0.895 


Results 


Heading 


1.000 


1.000 


1.000 




Body 


0.948 


0.912 


0.930 




Sub-heading 


0.947 


0.843 


0.892 


Methods 


Heading 


1.000 


1.000 


1.000 




Body 


0.992 


0.927 


0.958 




Sub-heading 


1.000 


0.982 


0.991 


Discussion 


Heading 


0.987 


1.000 


0.993 




Body 


0.946 


0.924 


0.935 




Sub-heading 


0.917 


0.885 


0.901 


Figure Legend 




0.986 


0.840 


0.907 


References 


Heading 


1.000 


0.988 


0.994 




Body 


0.532 


0.632 


0.578 


Supporting Information 


Heading 


0.988 


1.000 


0.994 




Body 


0.946 


0.224 


0.362 


Macro Average 




0.956 


0.888 


0.910 



contributing to biocuration. Our evaluation therefore seeks 
to quantitatively capture the notion of 'flow-disruption'. 
Our strategy is based on comparing text extracted by 
PDF2Text and LA-PDFText for a given set of research arti- 
cles, against the text extracted from the XML representa- 
tion of that paper within the Open Access Subset. We 
chose PLoS Biology articles at random from volumes 5, 6, 
7, and 8 for this evaluation. These XML files contain the 
full-text of their corresponding articles, along with the ne- 
cessary markup that demarcates each section of the paper. 
The XML does not contain headers and footers present in 
the original PDF. 

We use a variant of the Needleman-Wunsch algorithm 
[18] to compute alignment costs for text extracted by 
both algorithms against text obtained from the Open 
Access XML for each paper. The Needleman-Wunsch 
algorithm uses dynamic programming to perform a glo- 
bal alignment on two sequences and using linear gap 
penalties. Our variant of this algorithm treats the Open 
Access text as a sequence of sentences and computes 
the cost of aligning sentences generated by LA-PDFText 
and PDF2Text with sentences in the Open Access text. 
The algorithm uses a gap penalty of -10, a mismatch 
penalty of -1 and a match reward of 5. Computed align- 
ment costs for each paper are normalized by dividing 
them by the number of sentences in the Open Access 
version of the text for that paper. The resulting number 
can be interpreted as the average per-sentence align- 
ment cost' for a given paper. The difference between 
normalized costs produced by both methods is plotted 
in the graph shown in Figure 4. A number greater than 
zero indicates that LA-PDFText produced a higher 
alignment score with respect to the Open Access text 
than PDF2Text for a particular paper. A number less 
than zero indicates that PDF2Text produced a better 
alignment score. Figure 4 shows that only 7 out of 86 
documents extracted by LA-PDFText (shown using +) 
produce a poorer alignment score with the Open Access 
text than PDF2Text (shown using -). In other words, in 
91% of the cases LA-PDFText outperforms PDF2Text 
(p < 0.001). It should be noted that the text extracted by 
LA-PDFText used in this experiment still contain errors 
introduced due to sections that have not been classified 
into any rhetorical categories (recall errors). Despite 
these classification errors LA-PDFText extracts text with 
fewer flow interruptions resulting in higher accuracy of 
extracted text than PDF2Text. 

Discussion 

LA-PDFText is designed to be a baseline system as a pre- 
cursor for further improvements to the block detection, 
classification and text extraction stages. In this section, we 
discuss the results of each stage of LA-PDFText presenting 
error analyses and identify proposed future improvements. 
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Introduction 

The innate immune response-the synthesis of antimicrobial 
peptides and mobilisation of dedicated immune cells-confers a 
broad protection against pathogens to all multicellular organisms. 
Drosophila has become a model for studying the role of hematopoietic 
(blood) cells and the evolution of cellular immunity (reviews by 
[1,2]). Similar to vertebrates, Drosophila hematopoiesis occurs in two 
waves during development [3] . A first population of hemocytes is 
specified in the embryo and gives rise to plasmatocytes involved in 
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phagocytosis and crystal cells required for melanisation [4]. A 
second wave of plasmatocyte and crystal cell production occurs at 
the end of larval development. Larval hematopoiesis can also give 
rise to a third cell type, the lamellocytes, which are devoted to the 
encapsulation of foreign bodies too large to be phagocytosed. 
Lamellocytes only differentiate in response to specific immune 
challenges such as parasitisation by wasps, a common threat for 
higher order insects [1,2,5,6]. Larval hematopoiesis takes place in a 
specialised organ, the lymph gland (LG), which forms during 
embryogenesis and grows during larval development. In third instar 
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Figure 3 Text Flow Interruptions. The image (A) in the figure above is a snippet of text extracted from the corresponding PDF file (shown in 
image B) by PDF2Text. The red arrows on the extracted text mark a break in text flow generated by PDF2Text owing to its inability to discount 
formatting embellishments like footers. Our evaluation of text extraction accuracy quantifies the effect of such flow-interruption on the quality of 
the output text produced by both PDF2Text and LA-PDFText. 



Step 1 - Detecting contiguous text blocks 

LA-PDFText s block detection algorithm is fairly accur- 
ate (see Spatial Segmentation Score in Additional file 4: 



Tables S4, S5, S6 and S7). Over the PLoS Biology data- 
set, block detection results in alignment scores with 
mean (u) = 9.5 and standard deviation (a) = 5.7. The 
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Figure 4 Text Flow Evaluations. The graph above shows the relative alignment cost of LA-PDFText and PDF2Text with respect to the gold 
standard. Each green dot represents the difference between the normalized alignment scores of LA-PDFText and PDF2Text for one paper in the 
PLoS Biology dataset. + markers show normalized alignment scores produced by LA-PDFText and - markers show normalized alignment scores 
produced by PDF2text. Results indicated that LA-PDFText extracts text with better alignment scores with respect to the gold standard than 
PDF2Text for 91% of the documents tested (p < 0.001). 
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algorithm depends on the accuracy of JPedal at identify- 
ing word blocks. Although it is expected that using a 
commercial version of JPedal will reduce these scores 
and improve block detection, we want LA-PDFText to 
be available for use without the need for users to pur- 
chase the commercial version (although we may release 
a version of our systems that can also work with the 
commercial version of JPedal). 

Step 2 - Classifying text blocks into rhetorical categories 

We have designed the segment classification component 
of LA-PDFText using a rule-based approach so as to make 
the system more flexible and easily adaptable for use with 
various journal formats. The classification results (Table 3) 
are based on rule files (see Figure 5) that we designed in 
roughly a single working day. The goal of our project is to 
provide a PDF-extraction library that can be customized 
for specific uses by BioNLP developers. Thus, we have 



provided a mechanism that requires a relatively small time 
investment from developers to classify PDF-based text 
blocks with suitable levels of accuracy. The software distri- 
bution includes a Microsoft Excel based 'decision table' 
which can be used to fill in values for features of blocks 
that cause rules to Tire' and generate an appropriate labels 
for blocks. The 'decision table' mechanism will also allow 
non-programmers to specify rules for block classification. 

We have identified specific errors in the rules that 
were responsible for poor performing categories 
(Table 3). Within PLoS Biology, the classification recall 
for the section titled 'Supporting Information' is only 
0.224 (Table 3). Close inspection of our dataset reveals 
that most supporting information sections contain figure 
legends, which belong to two categories namely 'Figure 
Legends' and 'Supporting Information'. The system cor- 
rectly classifies the blocks as figure legends but not as 
supporting information. Both the precision and recall of 



^created on: Jul 30, 2010 

package edu . isi . bmkeg . pdf . classification . rules 
#list any import classes here. 

import edu. isi .bmkeg.pdf .features. ChunkFeatures; 
import edu .isi .bmkeg.pdf .model .ChunkBlock; 

^declare any global variables here 

global ChunkBlock chunk; 



rule "Title" 

activation-group "blockClassifi cation" 

salience 4 

when 

ChunkFeatures(pageNumber==l) 
ChunkFeatures(mostPopularFontSize=20) 

eval (chunk. getNumberOf Line()<=6) 
ChunkFeatures(allignedMiddle==true) 

then 

chunk . setType(chunk . TYPE_TITLE) ; 

end 



rule "Authors" 

activation-group "blockClassifi cation" 

salience 4 

when 

ChunkFeatures(pageNumber==l) 
ChunkFeatures(mostPopularFontSize=10) 

evalCchunk. getNumberOf Li ne()<=6) 
ChunkFeatures(allignedMiddle==true) 

features :ChunkFeatures() 

eval(features . isMatchingRegularExpression(" A (Summ I [Aa]bst I SUMM I ABST)")==false) 

then 

chunk . setTypeCchunk . TYPE_AUTH0RS) ; 

end 



Figure 5 Sample Rule File Listing. The figure shows examples of DROOLS Rules for block classification. DROOLS files meant for two epochs 
within the PLoS Biology dataset are available as a part of the software distribution accompanying this paper. They can also be downloaded from 
http://code.google.eom/p/lapdftext/. The two files included are named epoch_7Jun_8.drl and epoch_5_7May.drl and are located in a folder 
called "rules' in the base directory of the installation. Experiments reported in this paper have been conducted using these rules for the block 
classification stage. These files are also included as supplementary material for this paper. 
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the section titled 'References' are 0.532 and 0.632 re- 
spectively (Table 3). We attribute the low score to the 
fact that the font used in tables in many papers is the 
same as that used in references. Since our baseline rule- 
set did not contain a rule to identify tables they get 
wrongly identified as references resulting in poor recall 
and precision. 

Step 3 - Stitching classified text blocks together in the 
correct order 

The quality of text extraction is best determined by the 
usability of the text by downstream text mining applica- 
tions. We have presented evaluations that show the ability 
of LA-PDFText to extract text with fewer flow interrup- 
tions than text extracted by PDF2Text. It should be noted 
that the evaluation of text extraction was done on full text 
of papers explicitly to contrast LA-PDFText with 
PDF2Text. LA-PDFText also provides the user with the 
additional capability to extract text on a per-section basis; 
a capability that PDF2Text does not support. 

Related work 

Since the introduction of Portable Document Format in 
1993 and the widespread development of online journals 
in the late 1990s, many archival documents published 
earlier have been scanned and converted into PDF. Fur- 
thermore, the scientific community and publishers have 
adopted PDF as the de facto standard format for scien- 
tific communication. In this paper we therefore do not 
focus on the Optical Character Recognition (OCR) prob- 
lem but instead assume that we are given PDF docu- 
ments that include the text, fonts, images, and 2D 
vector graphics. We are primarily concerned with related 
work in development of PDF extraction systems that 
support BioNLP work in the academic community. 

Discovering the logical structure of documents is a well- 
studied problem. However most past efforts were aimed a 
logical-structure discovery [19,20] and not explicitly aimed 
at text extraction from PDF documents. Furthermore, 
these past efforts used OCR to produce images of docu- 
ment pages, which are then segmented and the segments 
are classified to discover logical structure. Summers et al. 
present a survey of methods for the document-logical- 
structure discovery problem [21]. While some methods 
surveyed by the author perform joint segmentation and 
classification, other methods separate these steps into dis- 
tinct phases. Certain methods use a multi-level form of 
bounding boxes as the basis of their joint segmentation 
and decision-tree based classification [19] for logical- 
structure discovery. All of the above methods are aimed at 
inducing some hierarchical representation of the docu- 
ment content from document images. The method pre- 
sented in this paper uses bounding boxes as well but 
separates the segmentation and classification phases. 



One recent effort aimed at recovering the logical struc- 
ture of the scholarly articles using Nuance OmniPage 16 to 
identify bounding boxes of words [22]. The bounding box 
information is represented in XML that includes markup 
indicating each line and paragraph within the input PDF. 
The words, lines and paragraph information along with 
font information of each word are used as features to train 
a Conditional Random Field (CRF) [23] model to classify 
each line into one of 23 predetermined classes correspond- 
ing to rhetorical categories. The method proposed in [22] 
relies on a commercial tool; a feature we seek to avoid 
here. The authors performed tests on two datasets: one 
comprising 40 scientific papers in the field of computer sci- 
ence and the other from their previous work comprising 
211 Association of Computing Machinery (ACM) 
papers. We downloaded the second dataset 3 and manually 
inspected the PDF documents. We observed that format- 
ting across the 211 papers from ACM is fairly regular using 
a two-column format. In contrast, we have tested LA- 
PDFText on articles from the journal Brain Research 
spanning volumes 1 to 1155. Manually we have identified 
10 significant formatting changes from 1966 to 2007. In 
order to deal with all articles within PubMed, 4 a PDF ex- 
traction system will have to deal with these formatting var- 
iations. The system developed by [22] also produces XML 
similar to the LA-PDFText system and can therefore pro- 
duce text on per-section basis. Upon close inspection of 
their results, we observed that formatting embellishments 
interrupt the flow of text extracted by their system in much 
the same way as it is in PDF2Text s results. We believe that 
this is due to the fact that their system does not use a rule- 
based classification of text blocks, and may not be flexible 
enough to incorporate this change without substantial ef- 
fort in feature engineering and retraining. 

PDF extraction was used in the Mouse Genome Inform- 
atics (MGI) system to generate text input for text-mining 
software in-situ [16]. They used a collection of commer- 
cial software (IntraPDF, PDFTron and specifically ProMi- 
ner) to extract text from PDF files but did not describe the 
process or outcome in detail, making it difficult to com- 
pare with our current work. Another toolset of particular 
interest is the Utopia documents platform [24,25]. Utopia 
uses PDF as the base framework for constructing an entire 
toolset within the familiar architecture of a paper. As a 
first step, the Utopia system performs the text extraction 
process with a high accuracy, but it does so directly within 
the rubric of the Utopia system. Our system is a library 
that provides low-level control of multiple components of 
the text extraction process and is designed specifically for 
use by other text mining developers. 

Conclusion & future work 

LA-PDFText is built using non-commercial compo- 
nents, making it freely available under the LGPL license. 
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We believe that it is a very useful tool for the 
BioNLP community owing to its flexibility and adapt- 
ability to a variety of journal formats with minimal 
rule-development effort. We plan to extend this work 
by extracting text and structure from tables [26], 
graphs, figures [27] and citations (C [28]). The sys- 
tems framework is designed in a modular fashion 
and can incorporate different methods for block de- 
tection and block classification. LA-PDFText will be 
put to immediate use in the development of a variety 
of biocuration applications. The next version of LA- 
PDFText will output annotations in compliance with 
ontologies such as Annotation Ontology [29,30] and 
ontologies about bibliographic records, citations, evi- 
dence and discourse relationships. 

Software verification 

In addition to open-source software distribution of LA- 
PDFText, we also provide the data set that was used in 
the evaluation presented in this paper (see Additional 
file 4). During our evaluation process each phase of our 
systems three-stage process produces intermediate files 
meant specifically for use by developers to monitor per- 
formance. For instance, the block classification phase 
produces images each page showing color-coded word 
blocks grouped using chunk block bounding boxes. This 
has been an invaluable tool for debugging rule files used 
in the classification process. Further details about verify- 
ing our systems output are forthcoming at the project 
page listed below. Our code contains unit tests that 
show how to programmatically invoke our system in all 
its modes of operation. We invite the reader to down- 
load the data set from the location indicated in the sup- 
plemental file and reconstruct our evaluation. 

Availability and requirements 

Project name: LA-PDFText - Layout- Aware Text Extrac- 
tion from Full-text PDF of Scientific Articles 

Project home page: http://code.google.eom/p/lapdftext/ 

Current Version: 1.7 

Operating system: MacOSX 10.6.7, Linux and Windows 
XP 

Programming language: Java 1.6 

Other requirements: none. 

License: GNU General Public License 

Endnotes 

1 Average taken over all class labels in the section classi- 
fication task 

2 http://download.cnet.com/BatchConvert-PDF2Text/ 
3000-2248_4-75147475.html 

3 http://wing.comp.nus.edu.sg/downloads/ 
keyphraseCorpus/NUSkeyphraseCorpus.zip 

4 http://www.ncbi.nlm.nih.gov/pubmed/ 



5 https://wiki.birncommunity.org/display/NEWBIRNCC/ 
SciKnowMine/ 

Additional files 



Additional file 1: Sample block classification rule file 'epoch_5_7May. 
dif. This file contains the rules for block classification for PLoS Biology 
articles in issue 5 to the May articles in issue 7 in DROOLS format. This 
rule-file can be used in conjunction with the LA-PDFText application 
available at http://code.google.eom/p/lapdftext/ 

Additional file 2: Sample block classification rule file 'epoch_7Jun_8.drl'. 
This file contains the rules for block classification for PLoS Biology articles 
in issue 7 from June to those in issue 8 in DROOLS format. This rule-file 
can be used in conjunction with the LA-PDFText application available at 
http://code.google.eom/p/lapdftext/ 

Additional file 3: Sample block classification rule file 'epoch_7Jun_8. 
csv'. This file contains the rules for block classification for PLoS Biology 
articles in issue 7 from June to those in issue 8 in CSV format. This rule- 
file can be used in conjunction with the LA-PDFText application available 
at http://code.google.eom/p/lapdftext/ 

Additional file 4: Contains supplemental Table 4, 5, 6 and 7 
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